Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Change conda env for AQM on Hera to shared one and fix input data issue on WCOSS2 #844

Merged
merged 9 commits into from
Jul 5, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Jun 24, 2023

DESCRIPTION OF CHANGES:

  • Change the python environments for AQM to the conda in the shared location on Hera.
  • Add the input data for AQM fire emission (TEST_AQM_INPUT_BASEDIR) to the shared directory for WE2E.
  • Fix the missing source attribute in AQM_ICS on WCOSS2 (Cactus) : update of the hash of AQM-utils.

TESTS CONDUCTED:

  • WE2E test for AQM on Hera and WCOSS2 (Cactus)
  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS:

@natalie-perlin

@MichaelLueken
Copy link
Collaborator

@chan-hoo I'm currently working on running the aqm_grid_AQM_NA13km_suite_GFS_v16 WE2E test on Hera. Once it successfully completes, I will move forward with approving this PR.

I do have a question about the closing of issue #684. Should we keep this issue open until all machines have AQM-specific tasks like WCOSS2, Cheyenne, and Hera and all use miniconda_regional_workflow_cmaq? It seems like this would need to be down before closing the issue. If you are okay with it, I would like to remove this issue from the issues that can be closed with this PR. Thanks!

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, I agree with you. Since AQM is not available on other machines, you can close the issue once this PR is merged.

@MichaelLueken
Copy link
Collaborator

@chan-hoo Since AQM can only run on Cheyenne, Hera, and WCOSS2, I will go ahead and keep the issue set as closable when this PR is merged.

@natalie-perlin
Copy link
Collaborator

@chan-hoo , @MichaelLueken - is it only the issue with staged data for running the AQM on other machines?.. I'm still working on moving the AQM data to the rest of the systems, so it should not be an issue when finished.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo I was able to successfully run the aqm_grid_AQM_NA13km_suite_GFS_v16 WE2E test on Hera. I do have one concern before moving forward with approving this PR. I noted that Orion also has AQM tasks, but it is still using miniconda_online-cmaq rather than miniconda_regional_workflow_cmaq.

Please update the Orion tasks to use the new miniconda_regional_workflow_cmaq, then I can approve this PR.

@chan-hoo
Copy link
Collaborator Author

@chan-hoo , @MichaelLueken - is it only the issue with staged data for running the AQM on other machines?.. I'm still working on moving the AQM data to the rest of the systems, so it should not be an issue when finished.

@natalie-perlin, you don't have to hurry. Besides, AQM is not available on other machines yet.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, updated!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thank you very much for making these changes on Orion! Approving now.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Jun 30, 2023

@chan-hoo - what is needed to make AQM available on all the machines?
Is it only miniconda_regional_workflow_cmaq environment and data staged?

@chan-hoo
Copy link
Collaborator Author

@natalie-perlin, I think so. The AQM input data are necessary and their paths should be defined in ush/machine/[machine].yaml. In addition, the module files in modulefiles/tasks/[machine]/[task] should be updated.

@natalie-perlin
Copy link
Collaborator

@chan-hoo - then it would be quite simple to update to the rest of the platforms! I already transferred data to most of the platforms, in ./UFS-SRW_data/aqm_data/ and
./UFS_SRW_data/develop/aqm_data/ and ./UFS_SRW_data/develop/input_model_data/FV3GFS (if this all that needed), and to install miniconda3 env on the platforms that may be missing it.

Note about Orion:
I'm preparing hpc-stacks in a new dedicated space, installing miniconda3, for both Orion and Hercules, as well as moving all the UFS_SRW_data to a new location that is accessible with epic role account:
/work/noaa/epic/role-epic .

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jun 30, 2023

@natalie-perlin, the issue is that the input data of NEXUS and point source is huge (NEXUS > 100TB, point source > 100GB). You can check their size on Hera (/scratch2/NCEPDEV/naqfc/RRFS_CMAQ/emissions/nexus). I don't think AQM will need the entire directory of NEXUS, but you'll need help from ARL.

@natalie-perlin
Copy link
Collaborator

@chan-hoo -
Yes, we probably won't need too much data to keep under epic account on all the machines. I just wonder if the subsets that you provided could be good enough to have AQM running, with the option not to use /scratch2/NCEPDEV/naqfc/RRFS_CMAQ/emissions/nexus data if certain directories are not set or found...
What is ARL?..

@chan-hoo
Copy link
Collaborator Author

NOAA's Air Resources Laboratory. They maintain the nexus and point source directories.

@bbakernoaa
Copy link
Contributor

@natalie-perlin @chan-hoo We have actually been trying to get the system up and running and most of the input data is probably already transferred for the GEFS/ UFS-Aerosol work in the fix directory. But we also maintained a backup on the each system for development work. We have the emissions here on Orion: /work2/noaa/naqfc/bbaker/emissions/nexus

@natalie-perlin
Copy link
Collaborator

@bbakernoaa - thank you for the info!

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Jul 5, 2023

@bbakernoaa @chan-hoo
Are all the data under /work2/noaa/naqfc/bbaker/emissions/nexus/ on Orion, or is there another location for fixed files for GEFS/ UFS-Aerosol work?

Is it correct that we do have all the data on Hera and Orion for running AQM? Cheyenne and other systems have [partial] data from Chan-Hoo in EPIC-maintained space in .../UFS_SRW/data/..., but we may need other (nexus) as well.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jul 5, 2023
@MichaelLueken
Copy link
Collaborator

The coverage WE2E tests were manually ran on Orion and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
deactivate_tasks                                                   COMPLETE               1.37
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             754.66
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   COMPLETE             357.63
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        COMPLETE             141.13
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              15.11
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp  COMPLETE              10.26
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             386.83
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              29.32
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             285.62
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              13.45
nco                                                                COMPLETE               7.62
2020_CAD                                                           COMPLETE              30.80
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2033.80

Awaiting successful completion of the rest of the automated tests before merging this work to develop.

@MichaelLueken
Copy link
Collaborator

The automated Jenkins tests have successfully passed on Cheyenne, Hera, and Jet. With the success of the manual run on Orion, all tests have successfully passed. Given this and the two approvals for this work, I will now move forward with merging this work to develop.

@MichaelLueken MichaelLueken merged commit cff6f5c into ufs-community:develop Jul 5, 2023
@chan-hoo chan-hoo deleted the feature/aqm_conda branch February 14, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
4 participants